INTERSPEECH.2007 - Speech Synthesis

Total: 22

#1 An HMM-based speech synthesis system applied to German and its adaptation to a limited set of expressive football announcements [PDF] [Copy] [Kimi1]

Authors: Sacha Krstulović ; Anna Hunecke ; Marc Schröder

The paper assesses the capability of an HMM-based TTS system to produce German speech. The results are discussed in qualitative terms, and compared over three different choices of context features. In addition, the system is adapted to a small set of football announcements, in an exploratory attempt to synthesise expressive speech. We conclude that the HMMs are able to produce highly intelligible neutral German speech, with a stable quality, and that the expressivity is partially captured in spite of the small size of the football dataset.

#2 Statistical vowelization of Arabic text for speech synthesis in speech-to-speech translation systems [PDF] [Copy] [Kimi1]

Authors: Liang Gu ; Wei Zhang ; Lazkin Tahir ; Yuqing Gao

Vowelization presents a principle difficulty in building text-to-speech synthesizers for speech-to-speech translation systems. In this paper, a novel log-linear modeling method is proposed that takes into account vowel and diacritical information at both the word level and character level. A unique syllable based normalization algorithm is then introduced to enhance both word coverage and data consistency. A recursive data generation and model training scheme is further devised to jointly optimize speech synthesizers and vowelizers for an English-Arabic speech translation system. The diacritization error rate is reduced by over 50% in vowelization experiments.

#3 A pair-based language model for the robust lexical analysis in Chinese text-to-speech synthesis [PDF] [Copy] [Kimi1]

Authors: Wu Liu ; Dezhi Huang ; Yuan Dong ; Xinnian Mao ; Haila Wang

This paper presents a robust method of lexical analysis for Chinese text-to-speech (TTS) synthesis using a pair-based Language Model (LM). The traditional way of Chinese lexical analysis simply regards the word segmentation and part-of-speech (POS) tagging as two separated phases. Each of them utilizes its own algorithms and models. Actually, the POS information is useful for word segmentation, and vice versa. Therefore, a pair-based language model is proposed to integrate basic word segmentation, POS tagging and named entity (NE) identification into a unified framework. The objective evaluation indicates that the proposed method achieves the top-level performance, and confirms its effectiveness in Chinese lexical analysis.

#4 A trainable excitation model for HMM-based speech synthesis [PDF] [Copy] [Kimi1]

Authors: R. Maia ; Tomoki Toda ; Heiga Zen ; Yoshihiko Nankaku ; Keiichi Tokuda

This paper introduces a novel excitation approach for speech synthesizers in which the final waveform is generated through parameters directly obtained from Hidden Markov Models (HMMs). Despite the attractiveness of the HMM-based speech synthesis technique, namely utilization of small corpora and flexibility concerning the achievement of different voice styles, synthesized speech presents a characteristic buzziness caused by the simple excitation model which is employed during the speech production. This paper presents an innovative scheme where mixed excitation is modeled through closed-loop training of a set of state-dependent filters and pulse trains, with minimization of the error between excitation and residual sequences. The proposed method shows effectiveness, yielding synthesized speech with quality far superior to the simple excitation baseline and comparable to the best excitation schemes thus far reported for HMM-based speech synthesis.

#5 Cross-language phonemisation in German text-to-speech synthesis [PDF] [Copy] [Kimi]

Authors: Jochen Steigner ; Marc Schröder

We present a TTS component for transcribing English words in German text. In addition to loan words, whose form does not change, we also cover xenomorphs, English stems with German morphology. We motivate the need for such a processing component, and present the algorithm in some detail. In an evaluation on unseen material, we find a precision of 0.85 and a recall of 0.997.

#6 Preliminary experiments toward automatic generation of new TTS voices from recorded speech alone [PDF] [Copy] [Kimi1]

Authors: Ryuki Tachibana ; Tohru Nagano ; Gakuto Kurata ; Masafumi Nishimura ; Noboru Babaguchi

To generate a new concatenative text-to-speech (TTS) voice from recordings of a human's voice, not only recordings but also additional information such as the transcriptions, prosodic labels, and the phonemic alignments are necessary. Since some of the information depends on the speaking style of the narrator, these types of information need to be manually added by listening to the recordings, which is costly and time consuming. To tackle this problem, we have been working on a totally trainable TTS system every component of which, including the text processing module, can be automatically trained from a speech corpus. In this paper, we refine the framework and propose several submodules to collect all of the linguistic and acoustic information necessary for generating a TTS voice from the recorded speech. Though completely automatic generation of a new voice is not yet possible, we report progress in the submodules by showing experimental results.

#7 Implementation and evaluation of an HMM-based Thai speech synthesis system [PDF] [Copy] [Kimi1]

Authors: Suphattharachai Chomphan ; Takao Kobayashi

This paper describes a novel approach to the realization of Thai speech synthesis. Spectrum, pitch, and phone duration are modeled simultaneously in a unified framework of HMM, and their parameter distributions are clustered independently by using a decision-tree based context clustering technique with different styles. A group of contextual factors which affect spectrum, pitch, and state duration, i.e., tone type, part of speech, are taken into account especially for a tonal language. The evaluation of the synthesized speech shows that tone correctness is significantly improved in some clustering styles, moreover the implemented system gives the better reproduction of prosody (or naturalness, in some sense) than the unit-selection-based system with the same speech database.

#8 Speech synthesis enhancement in noisy environments [PDF] [Copy] [Kimi1]

Authors: Davide Bonardo ; Enrico Zovato

This paper reports recent activities made to improve the intelligibility of synthesized speech in noisy environments. Nowadays Text-To-Speech technologies (TTS) are used in many embedded devices like mobile phones, PDAs, car navigation systems, etc. This means that speech can be produced in different types of environments where background noise can significantly degrade the perception of the synthetic message and consequently its intelligibility.

#9 Tagging syllable boundaries with joint n-gram models [PDF] [Copy] [Kimi]

Authors: Helmut Schmid ; Bernd Möbius ; Julia Weidenkaff

This paper presents a statistical method for the segmentation of words into syllables which is based on a joint n-gram model. Our system assigns syllable boundaries to phonetically transcribed words. The syllabification task was formulated as a tagging task. The syllable tagger was trained on syllable-annotated phone sequences. In an evaluation using ten-fold cross-validation, the system correctly predicted the syllabification of German words with an accuracy by word of 99.85%, which clearly exceeds results previously reported in the literature. The best performance was observed for a context size of five preceding phones. A detailed qualitative error analysis suggests that a further reduction of the error rate by up to 90% is possible by eliminating inconsistencies in the training database.

#10 Hierarchical non-uniform unit selection based on prosodic structure [PDF] [Copy] [Kimi]

Authors: Jun Xu ; Dezhi Huang ; Yongxin Wang ; Yuan Dong ; Lianhong Cai ; Haila Wang

In speech synthesis systems based on wave concatenation, using longer units can generate more natural synthetic speech. In order to improve the usage of longer units in the corpus, this paper proposed a hierarchical non-uniform unit selection framework. Each layer included in the framework is an independent searching procedure which searches for different sized units and adopts suitable naturalness measuring functions related to the unit type. We have applied it to our Mandarin speech synthesis system according to the Chinese prosodic structure with respect to the statistical result in our corpus. Experiment result shows it outperforms our previous system.

#11 Control of an articulatory speech synthesizer based on dynamic approximation of spatial articulatory targets [PDF] [Copy] [Kimi]

Author: Peter Birkholz

We present a novel approach to the generation of speech movements for an articulatory speech synthesizer. The movements of the articulators are modeled by dynamical third order linear systems that respond to sequences of simple motor commands. The motor commands are derived automatically from a high level schedule for the input phonemes. The proposed model considers velocity differences of the articulators and accounts for coarticulation between vowels and consonants. Preliminary tests of the model in the framework of an articulatory speech synthesizer indicate its potential to produce realistic speech movements and thereby to contribute to a higher quality of the synthesized speech.

#12 A preselection method based on cost degradation from the optimal sequence for concatenative speech synthesis [PDF] [Copy] [Kimi]

Authors: Nobuyuki Nishizawa ; Hisashi Kawai

A novel unit preselection criterion for concatenative speech synthesis is proposed. To reduce the computational cost for unit selection, units that are unlikely to be selected should be pruned as preselection before Viterbi search. Since the criterion is defined as the difference between the cost of the locally optimal sequence where a unit is fixed and that of the globally optimal sequence, not only the target cost but also the concatenation cost can be taken into account in preselection. For real-time speech synthesis, a preselection method using decision trees, where a unit can be bound to multiple nodes of a tree, is also introduced. Results of a unit selection experiment show that the proposed method using decision trees built from 8-hour training data is superior in the costs of the selected units to the conventional online preselection based on target costs. The experimental results also show that the method is more effective where the computational cost is strongly limited.

#13 Line cepstral quefrencies and their use for acoustic inventory coding [PDF] [Copy] [Kimi]

Authors: Guntram Strecha ; Matthias Eichner ; Rüdiger Hoffmann

Line spectral frequencies (LSF) are widely used in the field of speech coding. Due to its properties, the LSF are qualified for the quantisation and the efficient compression of speech signals. In this paper we introduce the line cepstral quefrencies (LCQ). They are derived from the cepstrum in the same manner as the LSF are derived from linear predictive coding (LPC) features. We show that the combination of the pole-zero transfer function of the cepstrum with the properties of LSF offers advantages for speech coding. We apply the LCQ features to compress an acoustic inventory, which is used for a low resource speech synthesis. It is shown that the compression performance of the LCQ features is better than those of the LSF features in terms of the mean spectral distance to the original inventory.

#14 Articulatory acoustic feature applications in speech synthesis [PDF] [Copy] [Kimi]

Authors: Peter Cahill ; Daniel Aioanei ; Julie Carson-Berndsen

The quality of unit selection speech synthesisers depends significantly on the content of the speech database being used. In this paper a technique is introduced that can highlight mispronunciations and abnormal units in the speech synthesis voice database through the use of articulatory acoustic feature extraction to obtain an additional layer of annotation. A set of articulatory acoustic feature classifiers help minimise the selection of inappropriate units in the speech database and are shown to significantly improve the word error rate of a diphone synthesiser.

#15 Approaches for adaptive database reduction for text-to-speech synthesis [PDF] [Copy] [Kimi1]

Authors: Aleksandra Krul ; Géraldine Damnati ; François Yvon ; Cédric Boidin ; Thierry Moudenc

This paper raises the issue of speech database reduction adapted to a specific domain for Text-To-Speech (TTS) synthesis application. We evaluate several methods: a database pruning technique based on the statistical behaviour of the unit selection algorithm and a database adaptation method based on the Kullback-Leibler divergence. The aim of the former is to eliminate the least selected units during the synthesis of a domain specific training corpus. The aim of the later approach is to build a reduced database whose unit distribution approximates a given target distribution. We evaluate these methods on several objective measures.

#16 Exploiting unlabeled internal data in conditional random fields to reduce word segmentation errors for Chinese texts [PDF] [Copy] [Kimi2]

Authors: Richard Tzong-Han Tsai ; Hsi-Chuan Hung ; Hong-Jie Dai ; Wen-Lian Hsu

The application of text-to-speech (TTS) conversion has become widely used in recent years. Chinese TTS faces several unique difficulties. The most critical is caused by the lack of word delimiters in written Chinese. This means that Chinese word segmentation (CWS) must be the first step in Chinese TTS. Unfortunately, due to the ambiguous nature of word boundaries in Chinese, even the best CWS systems make serious segmentation errors. Incorrect sentence interpretation causes TTS errors, preventing TTS's wider use in applications such as automatic customer services or computer reader systems for the visually impaired. In this paper, we propose a novel method that exploits unlabeled internal data to reduce word segmentation errors without using external dictionaries. To demonstrate the generality of our method, we verify our system on the most widely recognized CWS evaluation tool - the SIGHAN bakeoff, which includes datasets in both traditional and simplified Chinese. These datasets are provided by four representative academies or industrial research institutes in HK, Taiwan, Mainland China, and the U.S. Our experimental results show that with only internal data and unlabeled test data, our approach reduces segmentation errors by an average of 15% compared to the traditional approach. Moreover, our approach achieves comparable performance to the best CWS systems that use external resources. Further analysis shows that our method has the potential to become more accurate as the amount of test data increases.

#17 On the role of spectral dynamics in unit selection speech synthesis [PDF] [Copy] [Kimi]

Authors: Barry Kirkpatrick ; Darragh O'Brien ; Ronán Scaife ; Andrew Errity

Cost functions employed in unit selection significantly influence the quality of speech output. Although unit selection can produce very natural sounding speech the quality can be inconsistent and is difficult to guarantee due to discontinuities between incompatible units. The join cost employed in unit selection to measure the suitability of concatenating speech units typically consists of sub costs representing the fundamental frequency and spectrum at the boundaries of each unit. In this study the role of spectral dynamics as a join cost in unit selection synthesis is explored. A number of spectral dynamic measures are tested for the task of detecting discontinuities. Results indicate that spectral dynamic measures correlate with human perception of discontinuity if the features are extracted appropriately. Spectral dynamic mismatch is found to be a source of discontinuity although results suggest this is likely to occur simultaneously with static spectral mismatch.

#18 ugloss: a framework for improving spoken language generation understandability [PDF] [Copy] [Kimi1]

Authors: Brian Langner ; Alan W. Black

Understandable spoken presentation of structured and complex information is a difficult task to do well. As speech synthesis is used in more applications, there is likely to be an increasing requirement to present complex information in an understandable manner. This paper introduces

#19 Combination of LSF and pole based parameter interpolation for model-based diphone concatenation [PDF] [Copy] [Kimi1]

Authors: Karl Schnell ; Arild Lacroix

For speech generation using small databases, spectral smoothing at the unit joints is necessary and can be realized by an interpolation of model parameters. For that purpose, the LSF are the best choice from the conventional parameter descriptions. This contribution shows how LSF interpolations can be improved using poles as parameters. The problem of the pole assignment between the two pole configurations at the unit joints is solved by pole tracking of an LSF transition. An inspection of the assignments determined by LSF transitions reveals unfavorable cases which can be corrected. A comparison between the LSF and the pole based interpolations shows that the LSF interpolations can be improved by the corrected pole assignments and by the trajectories of the poles. The investigations are performed using a diphone database which is analyzed by an extended LPC model in lattice structure including vocal tract losses.

#20 Automatic building of synthetic voices from large multi-paragraph speech databases [PDF] [Copy] [Kimi]

Authors: Kishore Prahallad ; Arthur R. Toth ; Alan W. Black

Large multi paragraph speech databases encapsulate prosodic and contextual information beyond the sentence level which could be exploited to build natural sounding voices. This paper discusses our efforts on automatic building of synthetic voices from large multi-paragraph speech databases. We show that the primary issue of segmentation of large speech file could be addressed with modifications to forced-alignment technique and that the proposed technique is independent of the duration of the audio file. We also discuss how this framework could be extended to build a large number of voices from public domain large multi-paragraph recordings.

#21 Automatic phonetic segmentation of Spanish emotional speech [PDF] [Copy] [Kimi1]

Authors: A. Gallardo-Antolín ; R. Barra ; Marc Schröder ; Sacha Krstulović ; J. M. Montero

To achieve high quality synthetic emotional speech, unit-selection is the state-of-the-art technique. Nevertheless, a large expensive phonetically-segmented corpus is needed, and cost-effective automatic techniques should be studied. According to the HMM experiments in this paper: segmentation performance can depend heavily on the segmental or prosodic nature of the intended emotion (segmental emotions are more difficult to segment than prosodic ones), several emotions should be combined to obtain a larger training set (especially when prosodic emotions are involved; this is especially true for small training sets) and a combination of emphatic and non-emphatic emotional recordings (short sentences vs. long paragraphs) can degrade overall performance.

#22 Iterative unit selection with unnatural prosody detection [PDF] [Copy] [Kimi1]

Authors: Dacheng Lin ; Yong Zhao ; Frank K. Soong ; Min Chu ; Jieyu Zhao

Corpus-driven speech synthesis is hampered by the occurrence of occasional glitches which ruin the impression of the whole utterance. We propose an iterative unit selection integrated with an unnatural prosody detection model to identify any unnatural prosody. The system searches an optimal path in the lattice, verifies its naturalness by the unnatural prosody model and replaces the bad section with a better candidate, until it passes the verification test. In light of hypothesis testing, we show this trial-and-error approach takes effective advantage of abundant candidate samples in the database. Also, in contrast to conventional prosody prediction, an unnatural prosody detection model still leaves enough room for the prosody variations. Unnaturalness confidence measures are studied. The combined model can reduce the objective distortion by 16.3%. Perceptual experiments also confirm the proposed approach improves the synthetic speech quality appreciably.